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1  Introduction 

Users  may  have  a  variety  of  tasks  that  give  rise  to  issuing 
a  particular  query.  The  goal  of  the  Tasks  Track  at  TREC 
2015  was  to  identify  all  aspects  or  subtasks  of  a  user’s  task 
as  well  as  the  documents  relevant  to  the  entire  task.  This 
was  broken  into  two  parts:  (1)  Task  Understanding  which 
judged  relevance  of  key  phrases  or  queries  to  the  original 
query  (relative  to  a  likely  task  that  would  have  given  rise 
to  both);  (2)  Task  Completion  which  performed  document 
retrieval  and  measured  usefulness  to  any  task  a  user  with 
the  query  might  be  peforming  through  either  a  completion 
measure  that  uses  both  relevance  and  usefulness  criteria  or 
more  simply  through  an  ad  hoc  retrieval  measure  of  rele¬ 
vance  alone.  We  submitted  a  run  in  the  Task  Understanding 
track.  In  particular,  since  the  anchor  text  graph  has  proven 
useful  in  the  general  realm  of  query  reformulation  [2],  we 
sought  to  quantify  the  value  of  extracting  key  phrases  from 
anchor  text  in  the  broader  setting  of  the  task  understanding 
track. 

Given  a  query,  our  approach  considers  a  simple  method 
for  identifying  a  relevant  and  diverse  set  of  key  phrases  re¬ 
lated  to  the  possible  tasks  that  might  have  given  rise  to  the 
query.  In  particular,  given  the  effectiveness  of  sessions  for 
producing  query  suggestions  as  well  as  the  fact  that  sessions 
tend  to  be  both  topically  coherent  and  cohesive  with  respect 
to  a  task,  we  investigated  the  effectiveness  of  mining  session 
co-occurrence  data.  For  a  search  engine  log,  session  bound¬ 
aries  can  be  defined  in  the  typical  way  but  to  operate  over  the 
anchor  text  graph,  we  need  some  notion  of  a  “session”.  We 
adopt  the  suggestion  of  Dang  &  Croft  [2]  and  treat  differ¬ 
ent  links  pointing  to  the  same  document  as  belonging  to  the 
same  “session”.  The  basic  assumption  is  that  the  anchor  text 
of  two  links  pointing  to  the  same  document  are  related  via 
the  common  reference.  Note  that  this  assumption  is  based 
on  the  destination  URL  of  the  link  being  the  same. 

Given  a  query,  we  then  find  matching  seed  candidates  (link 
text  from  the  web  graph  or  queries  over  search  logs)  and 
expand  to  related  candidate  key  phrases  via  this  session  as¬ 
sumption.  The  final  ranking  is  based  on  a  combination  of 


session  count  and  the  similarity  of  a  link  to  the  query.  Ad¬ 
ditionally  we  perform  several  types  of  filtering  that  prevent 
over-expanding  the  set  of  related  queries.  We  refer  to  the 
method  as  “having  coverage”  if  the  method  was  able  to  find 
a  matching  seed  -  since  this  is  a  necessary  step  to  producing 
any  candidates  based  on  co-occurrence. 

Empirical  results  demonstrate  generally  good  perfor¬ 
mance  for  the  method  when  it  finds  a  matching  seed.  In  par¬ 
ticular,  of  the  34  topics  judged  for  the  Query  Understanding 
track,  our  method  had  coverage  62%  of  the  time  (21  topics). 
When  the  method  has  coverage,  the  suggested  key  phrases 
are  above  the  mean  performance  (by  nearly  every  measure 
reported)  2/3  times  and  the  best  performer  1/3  times.  Given 
it’s  simplicity  and  availability  to  nearly  all  participants  as 
well  as  the  fact  that  coverage  can  be  detected  before  sub¬ 
mission,  it  is  a  promsing  candidate  for  future  investigation 
in  the  track.  We  now  describe  the  method  and  results  more 
fully  before  summarizing. 

2  Query  Understanding  Task 

This  section  deals  with  the  problem  investigated  in  the  Query 
Understanding  task  of  the  track.  That  is,  how  can  we  effec¬ 
tively  identify  all  key  phrases  or  alternate  queries  that  might 
be  involved  in  any  task  a  user  might  have  which  would  give 
rise  to  the  observed  query  for  a  topic? 

2.1  Approach  Overview 

To  provide  a  brief  overview,  we  remind  the  reader  that  our 
goal  is  to  investigate  the  effectiveness  of  basic  session  co¬ 
occurrence  for  the  task  understanding  task.  For  search  logs, 
taking  the  commonly  accepted  session  boundaries  of  a  pe¬ 
riod  of  user  activity  demarcated  by  30  minutes  of  inactivity 
is  straightforward,  but  to  apply  the  same  approach  to  anchor 
text  we  need  a  notion  of  session.  Like  others  [2]  for  the  an¬ 
chor  text  graph,  we  define  the  text  of  two  links  to  co-occur 
in  a  “session”  if  the  links  point  to  the  same  destination  URL. 
Given  a  query,  we  find  matching  seed  candidates  (link  text 
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from  the  web  graph  or  queries  over  search  logs)  using  a  soft 
matching.  These  seed  candidates  are  further  expanded  to  all 
queries/links  co-occurring  in  a  session.  The  candidates  are 
pruned  to  filter  out  any  globally  frequent  queries  across  all 
sessions,  extremely  long  queries,  or  candidates  that  do  not 
provide  a  minimum  similarity  to  the  original  query.  The  final 
ranking  is  a  simple  product  of  the  session  count  in  matched 
sessions  and  a  similarity  of  the  candidate  with  the  original 
query.  We  now  describe  the  method  in  more  detail. 

2.1.1  ClueWebl2  Anchor  Text  Graph 

For  most  interested  participants  an  extraction  of  the  an¬ 
chor  text  graph  of  ClueWebl2  is  easily  available  at  http  : 
/ / wwwhome . ewi . utwente . nl/ -hiemstra/ 2013/ 
anchor-text-f or-cluewebl2  . html  [3],  however 
we  chose*  to  produce  an  anchor  text  graph  directly  from  the 
ClueWebl2  dataset.^ 

In  particular,  we  used  the  publically  available  HTM- 
LAgilityPack**  v.  1. 4.9.0.  Similar  to  Hiemstra  &  Hauff  [3] 
we  only  process  documents  whose  html  size  was  less  than 
50K  bytes  and  we  discard  “javascript”  or  “mailto”  links. 
Additionally,  we  attempt  to  retain^  only  links  where  the 
destination  is  to  an  external  site.  We  do  this  since  inter¬ 
nal  links  are  often  either  navigational  or  assume  context. 
That  is,  anchor  text  of  “bothell  campus”  on  a  “University 
of  Washington”  page  is  likely  pointing  at  the  home  page 
for  the  “University  of  Washington  at  Bothell”  but  the  sim¬ 
ple  descriptor  “bothell  campus”  would  not  be  a  high  quality 
keyphrase  suggestion  absent  the  context.  Additionally,  we 
also  only  retain  links  whose  destination  URL  resolves  to  a 
document  in  ClueWebl2  (i.e.  we  discard  links  pointing  out 
of  ClueWebl2).  This  was  done  under  the  assumption  that 
documents  within  ClueWebl2  may  have  better  crawl  cover¬ 
age  across  multiple  incoming  links. 

2.1.2  Normalizing  Queries  to  Filter  Phrases 

For  the  text  in  each  of  the  query  fields  in  the  topic  xml  file, 
queries  were  first  normalized  by  removing  multiple  whites¬ 
paces  and  converting  all  characters  to  lower  case.  After  this, 
we  removed  stopwords  from  each  of  the  query  text.  Rather 
than  stopword  lists  that  are  developed  based  on  frequency 
alone,  we  used  a  list  of  English /wncf/on  words,  e.g.,  “a”, 
“about”,  “the”,  “to”,  “who”,  etc.  While  function  words  are 
correlated  with  frequency,  anecdotally  we  found  other  pub¬ 
lished  stopword  lists  developed  by  frequency  alone  or  by 
taking  the  most  frequent  words  in  our  corpora  to  be  too  ag¬ 
gressive.  Investigating  the  impact  of  this  choice  is  an  inter¬ 
esting  direction  for  future  work. 

'Unfortunately  the  anchor  text  distribution  is  via  peer-to-peer  software 
the  use  of  which  is  procedurally  complicated  at  our  organization. 

''See  http  :  /  /www .  lemurpro  ject .  org/cluewebl2  .  php/  for 
more  on  ClueWebl2. 

^http : / /htmlagilitypack . codeplex . com/ 

"'Many  types  of  relative  links  might  not  be  detected  as  well  as  sites  which 
appear  to  be  different  by  the  URL  but  are  owned  by  the  same  organization. 


As  our  list  of  English  function  words,  we  took  the  list  of 
22 1  words  published  by  Cook  [  1  ]  and  available  for  download 
at  http : / /homepage . ntlworld . com/ vivian . c/ 
Words/StructureWordsList . htm.  Eor  a  topic,  t, 
with  a  particular  query  qt,  we  refer  to  the  output  after  query 
normalization  and  stopword  removal  as  a  filter  phrase,  ft- 

2.1.3  Compute  Globally  Frequent  Candidates 

For  any  query  suggestion  method,  any  co-occurrence  ap¬ 
proach  has  to  deal  with  globally  frequent  items  that  co-occur 
independently.  To  deal  with  this,  we  compute  the  top  IK 
most  frequent  texts  based  on  the  input.  For  example,  for  the 
ClueWebl2  anchor  text  graph,  the  top  four  most  common 
anchor  texts  are:  “next”,  “permalink”,  “prev”,  and  “read 
more”.  These  top  IK  globally  frequent  candidates  will  be 
pruned  out  later. 

2.1.4  Matching  Filter  Phrases  to  Seed  Candidates 

After  normalizing  queries  to  filter  phrases  as  described  in 
Section  2.1.2,  for  each  topic  a  filter  phrase  ft  is  considered 
to  match  a  candidate  seed,  ct,  if  the  candidate  is  a  superset 
of  ft-  That  is  the  candidate  seed,  Ct  contains  at  least  all  of 
the  words  in  /*.  Because  the  filter  phrase,  ft,  is  never  bigger 
than  the  original  query,  qt.  This  means  the  filter  phrase  will 
match  seeds  that  exactly  match  the  query,  partially  overlap 
with  the  query  (as  long  as  all  non-function  words  overlap), 
and  are  supersets  of  the  query.  The  intuition  behind  adding 
supersets  is  since  the  goal  is  to  identify  all  possible  tasks  that 
might  lead  to  the  query,  users’  queries  or  anchor  text  links 
often  contain  extra  words  that  are  specific  to  some  particu¬ 
lar  task.  An  interesting  line  of  future  work  is  to  separate  the 
contribution  of  exact  matching  seeds  (with  and  without  or¬ 
der),  overlapping  matching  seeds,  and  superset  seeds.  This 
set  of  matching  sessions  St,  which  contain  a  candidate  seed 
match,  is  the  basis  for  the  remainder  of  our  computation  for 
a  topic  t. 

2.1.5  Expanding  to  Related  Candidates 

From  the  matching  sessions  containing  a  matching  seed,  we 
expand  to  related  candidates  by  removing  all  globally  fre¬ 
quent  candidates  and  simply  counting  the  number  of  ses¬ 
sions  in  St  each  remaining  query  occurs  in.  As  in  the  query 
suggestion  literature,  future  work  could  consider  multiple 
rounds  of  expansion  or  a  weighted  random  walk. 

2.1.6  Filtering  and  Similarity  Weighting 

The  expansion  to  related  candidates  based  on  co-occurrence 
has  several  types  of  common  failures.  To  be  conservative 
and  attempt  to  eliminate  these  failures,  we  require  a  candi¬ 
date  to  have  overlap  with  the  filter  phrase  for  a  topic  and 
meet  a  length  restriction  (very  long  texts  will  tend  to  match 


spuriously).  In  particular,  for  every  topic  t  and  candidate 
keyphrase  kt  and  filter  phrase  ft : 

1.  kt  is  discarded  if  cos{kt,  ft)<0  (i-e.  there  must  be  at 
least  one  word  overlap). 

2.  kt  is  discarded  if  its  length  is  longer  than  a  multiple  of 
ft,  i-e.  if  kt  >  ftL.  We  choose  L  =  4  to  allow  a 
generous  but  not  extreme  upper  bound.  This  condition 
gets  rid  of  extremely  long  candidates  -  usually  pastes 
in  search  logs  or  bad  tag  closures  in  anchor  text. 

To  produce  the  final  score,  it’s  intuitive  that  not  only 
should  the  frequency  of  occurrence  in  a  matching  session 
matter,  but  the  candidate’s  similarity  to  the  query  is  likely 
important.  To  account  for  this,  we  weight  the  count  of  oc¬ 
currences  in  matching  sessions,  Sk^,  by  the  similarity  to  the 
filter  phrase  before  normalization.  More  precisely,  we  first 

1.  Scale  the  matching  session  occurrence  count  for 
the  keyphrase  by  the  similarity  to  the  filter  phrase: 

s—^  =  cosiktJt)-Sk,  . 

2.  Normalize  the  final  score  by  the  max  in  the  topic: 

This  simply  scales 
the  final  score  to  the  [0, 1]  range  and  does  not  alter  the 
final  ranking. 

2.2  Evaluation 

Table  1  reports  the  mean  across  all  topics  differences  from 
the  per  topic  mean  by  each  performance  measure  (i.e.  pos¬ 
itive  means  above  mean  overall  and  negative  below  mean). 
Table  2  reports  similar  values  but  only  over  the  21  of  the 
34  judged  topics  where  the  method  had  coverage.  In  the  re¬ 
maining  13  a  matching  seed  was  not  found.  Table  3  reports 
of  the  21  times  when  the  method  had  coverage,  the  num¬ 
ber  of  topics  where  the  method  the  best  performer  or  above 
average. 

As  can  be  seen,  overall  the  method  falls  below  the  mean 
across  all  topics,  but  when  taking  Table  2  into  account,  this 
is  because  the  method  sometimes  lacks  coverage.  Since  this 
can  be  easily  detected,  the  potential  of  this  method  for  use 
in  combination  with  other  techniques  is  represented  by  its 
performance  when  it  has  coverage.  In  those  cases,  the  mean 
across  topics  is  quite  positive  with  (as  seen  in  Table  3)  the 
method  performing  above  the  mean  2/3  times  and  obtaining 
the  best  performance  1/3  times  that  it  has  coverage.  Overall, 
this  speaks  well  to  the  potential  for  combining  this  method 
with  techniques  used  by  other  participants. 


either  common  destinations  in  the  anchor  text  graph  or  ses¬ 
sion  co-occurrence  in  a  search  log  to  find  related  candidates. 
Simple  steps  of  filtering  are  applied  to  remove  globally  fre¬ 
quent  candidates  as  well  as  candidate  keyphrases  that  have 
no  similarity  to  the  core  of  the  original  query. 

Overall,  empirical  results  demonstrate  generally  good  per¬ 
formance  for  the  method  when  it  finds  a  matching  seed. 
In  particular,  of  the  34  topics  judged  for  the  Query  Under¬ 
standing  track,  our  method  had  coverage  62%  of  the  time 
(21  topics).  When  the  method  has  coverage,  the  suggested 
key  phrases  are  above  the  mean  performance  (by  nearly  ev¬ 
ery  measure  reported)  2/3  times  and  the  best  performer  1/3 
times.  Given  its  simplicity  and  availability  to  nearly  all  par¬ 
ticipants  as  well  as  the  fact  that  coverage  can  be  detected 
before  submission,  it  is  a  promising  candidate  for  future  in¬ 
vestigation  in  the  track. 

References 

[1]  V.  Cook.  Designing  a  basic  parser  for  call. 

CALICO  Journal,  6(l):50-67,  1988.  http: 

/ /homepage . ntl world . com/ vivian . c/ 
Writings /Papers/ CalicoPaper88 . htm. 

[2]  V.  Dang  and  W.  Croft.  Query  reformulation  using  an¬ 
chor  text.  In  WSDM  2010,  pages  41-50,  2010. 

[3]  D.  Hiemstra  and  C.  Hauff.  Mirex:  Mapreduce  infor¬ 
mation  retrieval  experiments.  Technical  Report  CTIT 
Technical  Report  TR-CTIT-10-15,  Centre  for  Telemat¬ 
ics  and  Information  Technology,  University  of  Twente, 
2010.  ISSN  1381-3625. 


3  Conclusions 

We  described  a  simple  approach  that  can  be  equally  applied 
to  either  search  logs  or  the  anchor  text  graph  for  finding 
keyphrases  for  related  tasks  to  a  query.  The  method  relies  on 
a  simple  matching  procedure  to  hnd  starting  seeds  and  uses 


ERR-IA@10 

ERR-IA@20 

ERR-IA@1000 

a-nDCG@10 

a-nDCG@20 

a-nDCG@1000 

-0.0154 

-0.0219 

-0.0232 

-0.0177 

-0.0446 

-0.0681 

Table  1:  Overall  Average  of  Difference  from  Mean  Topic. 


ERR-IA@10 

ERR-IA@20 

ERR-IA@1000 

a-nDCG@10 

a-nDCG@20 

a-nDCG@1000 

0.0959 

0.0894 

0.0882 

0.1126 

0.0871 

0.0632 

Table  2:  Overall  Average  of  Difference  from  Mean  Topic  when  Coverage. 


ERR-IA@10 

ERR-IA@20 

ERR-IA@1000 

a-nDCG@10 

a-nDCG@20 

a-nDCG@1000 

Max 

7 

7 

7 

7 

6 

6 

>  Mean 

15 

15 

15 

15 

15 

14 

Table  3:  Times  best  and  better  than  mean  when  coverage. 


